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Abstract —In this paper, we consider a statistical problem of 
learning a linear model from noisy samples. Existing work has 
focused on approximating the least squares solution by using 
leverage-based scores as an importance sampling distribution. 
However, no finite sample statistical guarantees and no computa¬ 
tionally efficient optimal sampling strategies have been proposed. 
To evaluate the statistical properties of different sampling strate¬ 
gies, we propose a simple yet effective estimator, which is easy for 
theoretical analysis and is useful in multitask linear regression. 
We derive the exact mean square error of the proposed estimator 
for any given sampling scores. Based on minimizing the mean 
square error, we propose the optimal sampling scores for both 
estimator and predictor, and show that they are influenced by the 
noise-to-signal ratio. Numerical simulations match the theoretical 
analysis well. 

I. Introduction 

In many engineering problems, it is expensive to measure 
and store a large set of samples. Motivated by this, there 
has been a great deal of work on developing an effective 
sampling strategy for a variety of matrix-based problems, 
including compressed sensing (2), 0, adaptive sampling for 
signal recovery and detection |4), 0, 0, adaptive sampling 
for matrix approximation and completion 0, QD, ID and many 
others. At the same time, motivated by computational effi¬ 
ciency, statistically effective sampling strategies are developed 
for least-squares approximation flOl . least absolute deviations 
regression IfTTII and low-rank matrix approximation |[T2ll . 

One way to develop a sampling strategy is to design a 
data-dependent importance sampling distribution from which to 
sample the data. A widely used sampling distribution to select 
rows and columns of the input matrix are the leverage scores, 
defined formally in Section 2. Typically, the leverage scores 
are computed approximately es, on, or otherwise, a random 
projection is used to precondition by approximately uniformiz- 
ing them d, (B). ESI, OH, E3- A detailed discussion of 
this approach can be found in the recent review on randomized 
algorithms for matrices and matrix-based data problems ITSl l. 
Even though leverage-based sampling distributions are widely 
used, for many problems, it is unclear that why they might 
work well from a statistical perspective. 

For the problem of low-rank matrix approximation, 03 
provided an extensive empirical evaluation of several sampling 
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distributions and showed that iterative norm sampling outper¬ 
forms leverage score based sampling empirically. Ii20l showed 
that for low-rank matrix approximation, the square root of the 
leverage score based sampling statistically outperforms uniform 
sampling all the time and outperforms leverage score based 
sampling when the leverage scores are nonuniform. Further, the 
authors proposed a constrained optimization problem resulting 
in optimal sampling probabilities, which are between leverage 
scores and their square roots. 

For the problem of linear regression, the optimal sampling 
strategies for estimation and prediction are known to be 
obtained via A and G-optimality criteria, respectively ifZTj . 
but the resulting sampling strategies are combinatorial. The 
computationally efficient optimal sampling strategies are still 
unclear. Il22l showed that based on the sampled least squares, 
neither leverage score based sampling nor uniform sampling 
outperforms the other. The leveraging based estimator could 
suffer from a large variance when the leverage scores are 
nonuniform while uniform sampling is less vulnerable to small 
leverage scores. Il23l only analyzed the asymptotic behavior 
of the sampled least squares and the corresponding optimal 
sampling is still unclear. 

In this paper, we propose an estimator with computationally 
efficient sampling strategy that also comes with closed form 
and finite sample guarantees on performance. We derive the 
exact mean square error (MSE) of this estimator and show the 
optimal sampling distribution for both estimator and predictor. 
The results of the statistical analysis can be summarized 
as follows: (1) the proposed estimator is unbiased for any 
sampling distribution; (2) the optimal sampling distribution is 
influenced by the noise-to-signal ratio; (3) the optimal sampling 
distribution involves a tradeoff between the leverage scores and 
the square root of the leverage scores. We further provide an 
empirical evaluation of the proposed algorithm on synthetic 
data under various settings. This empirical evaluation clearly 
shows the efficacy of the derived optimal sampling scores for 
sampling large-scale data in order to learn a linear model from 
the noisy samples. 

Except for the simple statistical analysis, the proposed 
sampled projection estimator is useful for the multitask linear 
regression. This is because the computational cost is less 
when we estimate multiple coefficient vectors based on the 
same data matrix. This is crucial for many applications on 
sensor networks and electrical systems l24l . For example, 
traffic speeds are time-evolving data supported on an urban 


traffic network. Based on the same urban traffic network, it is 
potentially efficient to use the proposed estimator to estimate 
the traffic speeds at all the intersections from a small number 
of samples every period of time. The proposed estimator also 
can be applied to estimate corrupted measurements in sensor 
networks l25l . uncertain voltage states in power grids l26ll and 
many others. 

The main contributions are (1) we derive the exact MSE of 
a natural modification of the least squares estimator; and (2) 
we derive the optimal sampling distributions for the estimator 
and the predictor, and show that they are influenced by the 
noise-to-signal ratio and they are related to leverage scores. 

II. Background 

We consider a simple linear model 

y = X/?(°)+e, 

where y £ R™ is a response vector, X £ R nxp is a data matrix 
(n > p ), p^ £ R p is a coefficient vector, and e £ R™ is a zero 
mean noise vector with independent and identically distributed 
entries, and covariance matrix cr 2 1. The task is, given X, y, 
we aim to estimate p(°\ In this case, the unknown coefficient 
vector is usually estimated via the least squares as 

ft = argmin||y - X/3|| 2 = (X T X) -1 X T y = X 1 y (1) 

where X^ = (X T X) -1 X T £ R pxn . The predicted response 
vector is then y = Hy, where H = XX^ is the hat matrix. 
The diagonal elements of H are the leverage scores, that is, 11, , 
is the leverage of the zth sample. The statistical leverage scores 
are widely used for detecting outliers and influential data H27fl . 

ESI, Q3). 

In some applications, it is expensive to sample the entire 
response vector. Some previous works consider sampling algo¬ 
rithms computing the approximate solutions to the overcon¬ 
strained least squares problem flOl . lfl8l . ||29l , l22l . These 
algorithms choose a small number of rows of X and the 
corresponding elements of y, and use least squares on the 
samples to estimate p(°\ Formally, let M. = (.Adi, • • • , M m ) 
be the sequence of sampled indices, called sampling set , with 
|.Ad| = m and Aii £ {l,--- ,n}. A sampling matrix ^ is a 
linear mapping from R" to R m that describes the process of 
sampling with replacement, 

1 , 3 = Mp, 

0, otherwise. 

When the jth element of y is chosen in the zth random trial, 
then .Ad; = j and = 1. The goal is given X, we 
design the sampling operator to obtain samples Ty, and 
then, estimate P^ . Here we focus on experimental design of 
sampling operator, that is, the sampling strategy needs to be 
designed in advance of collecting any j/,. 

The choice of samples is an important degree of freedom 
when studying the corresponding quality of approximation. In 
this paper, we employ random sampling with an underlying 
probability distribution to choose samples. Let {7r,}" be a 
probability distribution, where tt, denotes the probability to 


choose the zth sample in each random trial. We consider 
two choices of the probability distribution: uniform sampling 
means that the sample indices are chosen from {1, 2, • • • , n} 
independently and randomly; and sampling score-based sam¬ 
pling which means that the sample indices are chosen from 
an importance sampling distribution that is proportional to a 
sampling score that is computed from the data matrijfl A 
widely-used sampling score is the leverage scores of the data 
matrix. Given the samples, one way to estimate p^ is by 
solving a weighted least square problem: 

P = argmin ||D ’Ey — D 'TX/?|| 2 

= (DT-X) t DT-y, (2) 

where D £ R mxm is a diagonal rescaling matrix with I),;, = 
1/ yjrmtj when 'f/ij = 1. This is called sampled least squares 
(SampleLS) 1221 . The computational cost of taking pseudo¬ 
inverse of D 'TX £ R mxp is 0(2p 2 m + p 3 ). The inverse term 
is involved with the sampling operator, but the computation 
is much cheaper than taking pseudo-inverse of X £ R nxp , 
which takes 0{2p 2 n + p 3 ) (n > m > p). However, it is 
hard to show the exact MSEs and the optimal sampling scores 
of SampleLS. f22l shows that, based on SampleLS, neither 
leverage score based sampling nor uniform sampling dominates 
the other from from a statistical perspective. 

III. Proposed Method 

To have a deeper statistical understanding of this task, we 
propose a similar, but simpler algorithm to deal with the same 
task, but it is easy to show its corresponding exact MSEs and 
the corresponding optimal sampling scores. 

A. Problem Formulation 

Similarly to (0, we estimate P^ by solving the following 
problem: 

P = argmin^ ||\k T 'I'D 2 \k T \l/y — X^U^ 

= X t T' T T-D 2 T' r T'y, (3) 

where D £ R nxn is the same diagonal rescaling matrix in (0. 
We call 0 sampled projection (SampleProj). This is akin to 
zero-filling the unobserved entries of y and rescaling the non¬ 
zero entries to make it unbiased. Comparing to the ordinary 
least squares, SampleProj does not benefit the computational 
efficiency because it still needs to taking pseudo-inverse of 
X £ R raxp and the computation involves the factor n; however, 
it is still useful when taking samples is expense. Comparing 
to SampleLS, this algorithm is more appealing when we aim 
to estimate multiple P^s based on the same data matrix X 
because the inverse term is not involved with the sampling 
operator. 

1 the terms of sampling distribution and sampling scores mean the same thing 
in this paper 



B. Statistical Analysis 

We obtain the exact mean squared error and an optimal 
sampling distribution for SampleProj. Our main contributions 
based on SampleProj can be summarized as follows: 

• the estimator is unbiased for any sampling scores 
(Lemma |T] ); 

• the closed-form solution (finite sample) on the MSE of 
the estimator and the predictor (Theorem © : and 

« analytic optimal sampling scores of the estimator and 
predictor are provided (Theorem©. 

Due to the lack of space, the full proofs of results appear 

in ID 


Lemma 1. The estimator /? in SampleProj is an unbiased 
estimator of that is, E f} = 


Lemma 2. The covariance of the estimator /3 in SampleProj 
has the following property: 


Tr(Covar [/? 


= E 


/3 — E 


P 




*=1 \z=i 


where a 1 is the variance of the Gaussian random noise. 


for experimentally designed sampling, we approximate before¬ 
hand. One way is to use the Cauchy-Schwarz inequality to 
bound Xi, 


(X/3 (0) )z = |xf^°>| < ||x*|| 2 pW 


An upper bound of the MSE of the estimator is then 


E 


p-p 


(0) 


< 


Tv(x'We<Xtf)-i 


?( 0 ) 


where (W q)i,i = (HxjHj ||/3 (0) ||2) /(rrnri). The corresponding 
sampling scores are 


—h 

—H 

11 ( x/ I2 /3(°) 2 + a2 

) 

X 

—h 

ll2(ll x Hl2 + | 

,.ieJ 


oc (X(X T X) -2 X T ) ;; ((XX T ) ; l + NSr) , (4) 

where NSR = cr 2 / ll/^ 0 ) 11^. 

11 11 ^ rp 

When the column vectors of X are orthonormal, X X is 
the identity matrix, the sampling scores © are between the 
leverage scores and the square root of the leverage scores. 


Combining the results of Lemmas © and [2] we obtain the 
exact MSE of both the estimator and the predictor. 

Theorem 1. Let /3 be the solution of SampleProj with sam¬ 
pling distribution of {77,;} " =1 . The mean squared error of the 
estimator /? is 


E 


•~P 


(o) 


Tr (x^ W c (X t ) T ') —— 
V /to 


( 0 ) 


where (Wq )i,i = ((X/?(°)) 2 + a 2 ) /(rrnri). The mean squared 
error of the predictor X /3 is 

X^-X/3 (0) ' 2 


E 


= Tr(HWc)-— X/3 (0) 


We next optimize over the mean squared errors and obtain 
the optimal sampling scores. 


Theorem 2. The optimal sampling score to minimize the mean 
squared error of the estimator (3 is 

m cx yj (X(X T X) -2 X T ) ^ ((xp(0))f + a 2 ); 

The optimal sampling scores to minimize the MSE of the 
predictor X j3 is 

t n ex yj H;^ ((X/?(°)) 2 + a 2 ). 

Theorem © shows that the optimal sampling scores depend 
on the signal strength and the noise. In practice, we cannot 
access to X /?(°) and we need to approximate the ratio between 
each (X/^ 0 ))/ and a 2 . For active sampling, we can collect 
the feedback to approximate each signal coefficient (X/?®)/; 


Corollary 1. Let the column vectors of X are orthonormal. 
When NSR —> +oo, the sampling scores © are the square 
root of the leverage scores, that is, ni = yTR/, for all l. When 
NSR —► 0, the sampling scores © are the leverage scores, that 
is, 7r; = H/^, for all l. 


An upper bound of the MSE of the predictor is then 


E 


X/3-X/3W 


< Tr(HW c ) - — 
m 


x/?(°) 


where (W c )z,z = II^H^ /( TO7r 0- The corresponding 

optimal sampling scores are 


77Z oc ^11X/11 2 ||/3(°) H 2 +cr 2 ) 

cx ^Hz.z ((XX t ) m + NSr). (5) 

When NSR goes to infinity, the optimal sampling scores 
are always the square root of the leverage scores for any X. 
When the column vectors of X are orthonormal, the sampling 
scores © are the same with © and are between the leverage 
scores and the square root of the leverage scores. 

Corollary 2. When NSR —y + 00 , the sampling scores © are 
the square root of the leverage scores, that is, 77 / = ^JWu, 
for all l. When the column vectors of X are orthonormal and 
NSR —y 0, the sampling scores © are the leverage scores, that 
is, 77/ = H/,/, for all l. 


IV. Empirical Evaluation 

In this section, we validate the proposed algorithm and 
statistical analysis on synthetic data. 
























































A. Synthetic Dataset 

We generate a 1000 x 20 data matrix X from multivariate t- 
distribution with 1 degree of freedom and covariance matrix 
X, whose elements are Xjj = 2 x 0.5l* -^1. The leverage 
scores of this matrix are nonuniform. This is the same as 
T\ in ll22l . We then generate a 20 x 1 coefficient vector /3°, 
whose elements are drawn from uniform distribution U(0 ,1). 
We test both noiseless and noisy cases. In the noiseless case, a 
response vector is y = X/3°; and in the noisy case, a response 
vector is y = X/3° + e, where e ~ Af(0,a 2 ). We vary a 
as 5,25,50,75,100. The corresponding ratios between ||e ||2 
and ||y ||2 are around 0.7%, 15%, 40%, 60%, 72% and the corre¬ 
sponding NSR, cr 2 / ||/3 (0) |j 2 , are around 4,100,400,800,1600. 


B. Experimental Setup 

We consider two key questions as follows: 

• do different sampling scores make a difference? 

• does noise make a difference? 

To answer these questions, we test the proposed SampleProj 
with different sampling scores for both noiseless and noisy 
cases. For each test, we compare 5 sampling scores, including 
uniform sampling as Uniform (blue), leverage scores as Lev 
(red), square root of the leverage scores as sql-Lev (orange), 
sampling scores for the estimator © as opt-Est (purple), and 
sampling scores for the predictor 0 as opt-Pred (green). For 
opt-Est and opt-Pred, we use the true noise-to-signal ratios. All 
the results are averaged over 500 independent runs. 

We measure the quality of estimation as 


err Es t 




where is the ground truth and j3 is an estimator, and we 
measure the quality of prediction as 


err Pred 


X/?-X/3(°) 

||x/?(°)|l 


C. Results 

Figure Q] compares the estimation error in both the noiseless 
and noisy scenarios of SampleProj as a function of sample 
size. We see that, as expected, the sampling scores for the 
estimator (0} outperform all other scores in both noiseless and 
noisy cases. Also, leverage scores perform well in the noiseless 
case, and square root of the leverage scores perform well in 
the noisy case, which matches the result in Corollary 0] 

Figure [2] compares the prediction error for noiseless and 
noisy cases of SampleProj as a function of sample size. We 
see that, the sampling scores for the predictor (0 result in the 
best performance in both the noiseless and noisy cases; leverage 
scores perform better than square root of the leverage scores in 
the noiseless cases. In Corollary [Q we see that square root of 
the leverage scores should perform better than leverage scores 
in a high noise case, thus, we suspect that the noise level is 
not high enough to see the trend. 



(a) Noiseless. 



Fig. 1: Comparison of estimation error for noisless and noisy 
cases of SampleProj in Task 1 as a function of sample size. 




Fig. 2: Comparison of prediction error for noisless and noisy 
cases of SampleProj as a function of sample size. 


To study how noise influences the results, given 200 samples, 
we show the estimation errors and the prediction errors of 
SampleProj as a function of noise level. Figure 0](a) shows that 
sampling scores for the estimator (0} consistently outperform 
the other scores. Leverage scores are better than square root 
of the leverage scores when the noise level is small, and 
square root of the leverage scores catch up with leverage scores 
when the noise level increases, which matches Corollary 0] 
Figure [3 (b) shows that sampling scores for the predictor 0 
consistently outperform the other scores in terms of prediction 
error; leverage scores are better than square root of the leverage 
scores when the noise level is small, and square root of the 
leverage scores catch up with leverage scores when the noise 
level increases, which matches Corollary [2] 

Now we can answer the two questions proposed in Sec- 
tion llV-Bl Based on SampleProj, the proposed sampling scores 
for the estimator consistently outperforms the other sampling 
scores in terms of the estimation error and the proposed sam¬ 
pling scores for the predictor consistently outperforms the other 
sampling scores in terms of the prediction error. Leverage score 
based sampling is better in the noiseless case, but the square 
roots of the leverage score based sampling is better in the noisy 
case. Note that many work in computer science theory typically 
does not address noise and mostly focuses on least squares ED, 
ED, QD, hence, from the algorithmic perspective, the leverage 
























Fig. 3: Comparison of estimation error and prediction error of 
SampleProj as a function of noise level. 

score sampling was considered optimal. 

In the experiments, we use the true NSRs for op¬ 
timal sampling scores for the estimator and the pre¬ 
dictor, which is impractical. In practice, for low noise 
cases, we set the NSR to zero and sample proportional 
to y/(X(X T X) -2 X T ) ;; (XX T )ij for the estimator and 

Hi ;(XX T ); j for the predictor; for high noise cases, we 

should sample proportional to f (X(X T X )~ 2 X T ) ( ~ for the 

estimator and ^/H ij for the predictor. In general, when X is 
non-orthonormal, these sampling scores are different than the 
leverage scores and the square root of the leverage scores. 

V. Conclusions 

We consider the problem of learning a linear model from 
noisy samples from a statistical perspective. Existing work on 
sampling for least-squares approximation has focused on using 
leverage-based scores as an importance sampling distribution. 
However, it is hard to obtain the precise MSE of the sampled 
least squares estimator. To understand the importance sampling 
distributions, we propose a simple yet effective estimator, 
called SampleProj, to evaluate the statistical properties of 
sampling scores. The proposed SampleProj is appealing for 
theoretical analysis and multitask linear regression. The main 
contributions are (1) we derive the exact MSE of SampleProj 
with a given sampling distribution; and (3) we derive the 
optimal sampling scores for the estimator and the predictor, 
and show that they are influenced by the noise-to-signal ratio. 
The numerical simulations show that empirical performance is 
consistent with the proposed theory. We have derived the opti¬ 
mal sampling strategy for a specific estimator, but identifying 
the optimal estimator with a computationally efficient sampling 
strategy remains an open direction. 
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Appendix 

A. Proof of Theorem [7] 

We first show the unbias of the estimator in Lemma Q] For 
each element in the bias term, we have 

( E K), 

= Ea 4 j£ j E (^)j^ + e)^ 

\MjGM 

= mEj ((X^,, —(X^ 0 )),) 

V m7 ti ) 

n i 

= TOV'(X t ),; i i-(X/3 (0) )/7ri 

ti mnl 

n 

= ^(XtjyPC^)), = (Vf)., 


/=! 


where (a) follows from that each sample is identically and 
independently draw from {7Tj}^_ 1 . 

We then show the covariance of the estimator. We first split 
0 — E f into two components, and then bound both one by 
one. 

(/?~ e [/?]). 

= E (Xt^.W^.M^X^+e)^.- (> 0) ). 
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MjEM 

= A“ + Af>. 

The first term captures variability due to sampling, while the 
second term captures variability introduced by noise. Since the 
noise is independent from the sampling set A4 , we can bound 
AW and A^ 2 ) separately. For A^ , we have 
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For the noise term A, , we have 
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We then combine both A| 1J and A- 2 \ and obtain the 
variance term. 
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Finally, we put the bias term and the variance term together 
































to obtain the exact MSE of the estimator. 
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B. Proof of Theorem Q] 

To obtain the optimal sampling scores for the estimator, we 
solve the following optimization problem. 

minTr(x t W c (X t ) T ) , 

subject to : E Kl = 1,7 Ti> 0. 
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The objective function is the upper bound on the MSE of the 
estimator derived in Theorem [j] and the constraints require 
{7r/}" =1 to be a valid probability distribution. The Lagrangian 
function is then 
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We set the derivative of the Lagrangian function to zero. 
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and then, we obtain the optimal sampling distribution to be 
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Similarly, we minimize Tr(HWc) to obtain the optimal 
sampling distribution for the predictor to be 


7 t ; cx ((Xp(°))f + a 2 ). 









































